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DISCUSSION: LATENT VARIABLE GRAPHICAL MODEL 
SELECTION VIA CONVEX OPTIMIZATION 

By Martin J. Wainwright 
University of California at Berkeley 

1. Introduction. It is my pleasure to congratulate the authors for an in- 
novative and inspiring piece of work. Chandrasekaran, Parrilo and Willsky 
(hereafter CPW) have come up with a novel approach, combining ideas from 
convex optimization and algebraic geometry, to the long-standing problem of 
Gaussian graphical model selection with latent variables. Their method is in- 
tuitive and simple to implement, based on solving a convex log-determinant 
program with suitable choices of regularization. In addition, they estab- 
lish a number of attractive theoretical guarantees that hold under high- 
dimensional scaling, meaning that the graph size p and sample size n are 
allowed to grow simultaneously. 

1.1. Background. Recall that an undirected graphical model (also known 
as a Markov random field) consists of a family of probability distributions 
that factorize according to the structure of undirected graph G = (V,E). In 
the multivariate Gaussian case, the factorization translates into a sparsity 
assumption on the inverse covariance or precision matrix [9]. In particular, 
given a multivariate Gaussian random vector (X\, . . . ,X p ) with covariance 
matrix S, it is said to be Markov with respect to the graph G if its precision 
matrix K = S _1 has zeroes for each distinct pair of indices (j, k) not in the 
edge set E of the graph. Consequently, the sparsity pattern of the inverse 
covariance K encodes the edge structure of the graph. The goal of Gaussian 
graphical model selection is to determine this unknown edge structure, and 
hence the sparsity pattern of the inverse covariance matrix. It can also be 
of interest to estimate the matrices K or S, for instance, in the Frobenius 
or ^-operator norm sense. In recent years, under the assumption that all 
entries of X are fully observed, a number of practical methods have been 
proposed and shown to perform well under high-dimensional scaling (e.g., 
[2, 5-7]). 
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Chandrasekaran et al. tackle a challenging extension of this problem, in 
which one observes only p coordinates of a larger p + h dimensional Gaussian 
random vector. In this case, the p x p precision matrix K of the observed 
components need not be sparse, but rather, by an application of the Schur 
complement formula, can be written as the difference K = S* — L* . The first 
matrix S* is sparse, whereas the second matrix L* is not sparse (at least 
in general), but has rank at most h, corresponding to the number of latent 
or hidden variables. Consequently, the problem of latent Gaussian graphical 
model selection can be cast as a form of matrix decomposition, involving a 
splitting of the precision matrix into sparse and low-rank components. Based 
on this nice insight, CPW propose a natural M-estimator for this problem, 
based on minimizing a regularized form of the (negative) log likelihood for 
a multivariate Gaussian, where the elementwise ^i-norm is used as a proxy 
for sparsity, and the nuclear or trace norm as a proxy for rank. Overall, the 
method is based on the convex program 

(1) (S,L) € argmin{-^(5 - L; £ n ) + A„(7||S||i + trace(L))} 

such that S y L y 0, 

where £(S — L; £ n ) is the Gaussian log-likelihood as a function of the preci- 
sion matrix S — L and the empirical covariance matrix S n of the observed 
variables. 

1.2. Sharpness of rates. On one hand, the paper provides attractive 
guarantees on the procedure (1) — namely, that under suitable incoherence 
conditions (to be discussed below) and a sample size n^p, the method is 
guaranteed with high probability: (a) to correctly recover the signed support 
of the sparse matrix S* , and hence the full graph structure; (b) to correctly 
recover the rank of the component L* , and hence the number of latent vari- 
ables; and (c) to yield operator norm consistency of the order ^J~^■ The 

proof itself involves a clever use of the primal-dual witness method [6], in 
which one analyzes an M-estimator by constructing a primal solution and 
an associated dual pair, and uses the construction to show that the optimum 
has desired properties (in this case, support and rank recovery) with high 
probability. A major challenge, not present in the simpler problem without 
latent variables, is dealing with the potential nonidentifiability of the matrix 
decomposition problem (see below for further discussion); the authors over- 
come this challenge via a delicate analysis of the tangent spaces associated 
with the sparse and low-rank components. 

On the other hand, the scaling n ^ p is quite restrictive, at least in com- 
parison to related results without latent variables. To provide a concrete 
example, consider a Gaussian graphical model with maximum degree al. For 
any such graph, again under a set of so-called incoherence or irrepresentabil- 
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ity conditions, the neighborhood-based selection of approach of Meinshausen 
and Biihlmann [5] can be shown to correctly specify the graph structure with 
high probability based on dlogp samples. Moreover, under a similar set 
of assumptions, Ravikumar et al. [6] show that the t\ -regularized Gaussian 
MLE returns an estimate of the precision matrix with operator norm error of 

/ d/^ lo 

the order y — . Consequently, whenever the maximum degree d is signif- 
icantly smaller than the dimension, results of this type allow for the sample 
size n to be much smaller than p. This discrepancy — as to whether or not the 
sample size can be smaller than the dimension — thus raises some interesting 
directions for future work. More precisely, one wonders whether or not the 
CPW analysis might be sharpened so as to reduce the sample size require- 
ments. Possibly this might require introducing additional structure in the 
low-rank matrix. From the other direction, an alternative approach would 
be to develop minimax lower bounds on latent Gaussian model selection, for 
instance, by using information-theoretic techniques that have been exploited 
in related work on model/graph selection and covariance estimation (e.g., 



1.3. Relaxing assumptions. The CPW analysis also imposes lower bounds 
on the minimum absolute values of the nonzero entries in S* , as well as the 

minimum nonzero singular values of L* — both must scale as O(y^). Clearly, 

some sort of lower bound on these quantities is necessary in order to estab- 
lish exact recovery guarantees, as in the results (a) and (b) paraphrased 
above. It is less clear whether lower bounds of this order are the weakest 
possible, and if not, to what extent they can be relaxed. For instance, again 
in the setting of Gaussian graph selection without latent variables [5, 6], 



the minimum values are typically allowed to be as small as f2(y -^p). More 
broadly, in many applications, it might be more natural to assume that the 
data is not actually drawn from a sparse graphical model, but rather can 
be well-approximated by such a model. In such settings, although exact re- 
covery guarantees would no longer be feasible, one would like to guarantee 
that a given method, either the M-estimator (1) or some variant thereof, 
can recover all entries of S* with absolute value above a given threshold, 
and/or estimate the number of eigenvalues of L* above a (possibly differ- 
ent) threshold. Such guarantees are possible for ordinary Gaussian graph 
selection, where it is known that i\ -based methods will recover all entries 
with absolute values above the regularization parameter [5, 6]. 

The CPW analysis also involves various types of incoherence conditions 
on the matrix decomposition. As noted by the authors, some of these as- 
sumptions are related to the incoherence or ir represent ability conditions im- 
posed in past work on ordinary Gaussian graph selection [5, 6, 11]; others are 
unique to the latent problem, since they are required to ensure identifiability 



[2, 8, 10]). 
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(see discussion below). It seems worthwhile to explore which of these inco- 
herence conditions are artifacts of a particular methodology and which are 
intrinsic to the problem. For instance, in the case of ordinary Gaussian graph 
selection, there are problems for which the neighborhood-based Lasso [5] 
can correctly recover the graph while the l\ -regularized log-determinant 
approach [4, 6] cannot. Moreover, there are problems for which, with the 
same order of sample size, the neighborhood-based Lasso will fail whereas 
an oracle method will succeed [10]. Such differences demonstrate that cer- 
tain aspects of the incoherence conditions are artifacts of l\ -relaxations. In 
the context of latent Gaussian graph selection, these same issues remain to 
be explored. For instance, are there alternative polynomial-time methods 
that can perform latent graph selection under milder incoherence condi- 
tions? What conditions are required by an oracle-type approach — that is, 
involving exact cardinality and rank constraints? 

1.4. Toward partial identifiability. On the other hand, certain types of 
incoherence conditions are clearly intrinsic to the problem. Even at the pop- 
ulation level, it is clearly not possible in general to identify the components 
(S*,L*) based on observing only the sum K = S* — L* . A major contribution 
of the CPW paper, building from their own pioneering work on matrix de- 
compositions [3], is to provide sufficient conditions on the pair (S*,L*) that 
ensure identifiability. These sufficient conditions are based on a detailed anal- 
ysis of the algebraic structure of the spaces of sparse and low-rank matrices, 
respectively. 

In a statistical setting, however, most models are viewed as approxima- 
tions to reality. With this mindset, it could be interesting to consider ma- 
trix decompositions that satisfy a weaker notion of partial identifiability. To 
provide a concrete illustration, suppose that we begin with a matrix pair 
(S*,L*) that is identifiable based on observing the difference K = S* — L* . 
Now imagine that we perturb K by a matrix that is both sparse and low- 
rank — for instance, a matrix of the form E = zz T where z is a sparse vector. 
If we then consider the perturbed matrix K := K + 5E = S* — L* + SE 
for some suitably small parameter 5, the matrix decomposition is longer 
identifiable. In particular, at the two extremes, we can choose between the 
decompositions K = (S* + SE) — L* , where the matrix (S* + 5E) is sparse, 
or the decomposition K = S* — (L* — 5E), where the matrix L* — 5E is 
low-rank. Note that this nonidentifiability holds regardless of how small we 
choose the scalar 5. However, from a more practical perspective, if we relax 
our requirement of exact identification, then such a perturbation need not 
be a concern as long as 5 is relatively small. Indeed, one might expect that it 
should be possible to recover estimates of the pair (S*,L*) that are accurate 
up to an error proportional to 5. 

In some of our own recent work [1], we have provided such guarantees for 
a related class of noisy matrix decomposition problems. In particular, we 
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consider the observation model 

(2) Y = X{S* -L*) + W, 

where X : W xp — >• R niXri2 is a known linear operator and W el niXn2 is a 
noise matrix. In the simplest case, X is simply the identity operator. Obser- 
vation models of this form (2) arise in robust PCA, sparse factor analysis, 
multivariate regression and robust covariance estimation. 

Instead of enforcing incoherence conditions sufficient for identifiability, 
the analysis is performed under related but milder conditions on the inter- 
action between S* and L*. For instance, one way of controlling the radius 
of nonidentifiability is via control on the "spikiness" of the low-rank com- 
ponent, as measured by the ratio a(L*) := nfejnf , where || • denotes 
the elementwise absolute maximum and ||| • denotes the Frobenius norm. 
For any nonzero p-dimensional matrix, this spikiness ratio ranges between 1 
and p: 

• On one hand, it achieves its minimum value by a matrix that has all its 
entries equal to the same nonzero constant (e.g., L* = 11 T , where 1 GR P 
is a vector of all ones). 

• On the other hand, the maximum is achieved by a matrix that concen- 
trates all its mass in a single position (e.g., L* = eief , where e\ € MP is 
the first canonical basis vector). 

Note that it is precisely this latter type of matrix that is troublesome in 
sparse plus low-rank matrix decomposition, since it is simultaneously sparse 
and low-rank. In this way, the spikiness ratio limits the effect of such trou- 
blesome instances, thereby bounding the radius of nonidentifiability of the 
model. The paper [1] analyzes an M-estimator, also based on elementwise 
t\ and nuclear norm regularization, for estimating the pair (S* ,L*) from the 
noisy observation model (2). The resulting error bounds involve both terms 
arising from the (possibly stochastic) noise matrix W and additional terms 
associated with the radius of nonidentifiability. 

The same notion of partial identifiability is applicable to latent Gaussian 
graph selection. Accordingly, it seems worthwhile to explore whether similar 
techniques can be used to obtain error bounds with a similar form — one 
component associated with the stochastic noise (induced by sampling), and 
a second deterministic component. Interestingly, under the scaling n ^ p 
assumed in the CPW paper, the empirical covariance matrix S n will be 
invertible with high probability and, hence, it can be cast as an observation 
model of the form (2) — namely, we can write (S™)" 1 = S* — L* + W, where 
the noise matrix W is induced by sampling. 



1 Here we follow the notation of the CPW paper for the sparse and low-rank components. 
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1.5. Extensions to non-Gaussian variables. A final more speculative yet 
intriguing question is whether the techniques of CPW can be extended to 
graphical models involving non-Gaussian variables, for instance, those with 
binary or multinomial variables for a start. The main complication here is 
that factorization and conditional independence properties for non-Gaussian 
variables do not translate directly into sparsity of the inverse covariance ma- 
trix. Nonetheless, it might be possible to reveal aspects of this factorization 
by some type of spectral analysis, in which context related matrix-theoretic 
approaches could be brought to bear. Overall, we should all be thankful to 
Chandrasekaran, Parillo and Willsky for their innovative work and the ex- 
citing line of questions and possibilities that it has raised for future research. 
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